Home Download Help Forum Resources Extensions FAQ NetLogo Publications Contact Us Donate Models: Library Community Modeling Commons Beginners Interactive NetLogo Dictionary (BIND) NetLogo Dictionary User Manuals: Web Printable Chinese Czech Farsi / Persian Japanese Spanish
|
NetLogo User Community Models(back to the NetLogo User Community Models)
WHAT IS IT?
This model implements Q-learning (Watkins 1989) a one-step temporal difference algorithm in the area of reinforcement learning.
HOW IT WORKS
The agent (strike aircraft, blue) has the ability to sense the state of the game in the form of health, distances, and number of weapons. After sensing the state and receiving a reward the agent can choose from 8 different actions to manipulate the state space such as evading left or right, flying towards a SAM, and firing a weapon towards the SAM. The following Q-Learning algorithm is used:
Q(s,a) = Q(s,a) + step-size * [reward + discount * max(Q(s’,a’)) – Q(s,a)]
The agent keeps makes moves until it runs out of weapons, dies, or kills the ‘target’ SAM site. The rewards are -2pts for weapons use, -200pts for dying, and +1000pts for killing the ‘target’ SAM. The agent also has the option of turning on the stealth technology, which allows the agent the ability to not be seen by the SAM sites.
HOW TO USE IT
The buttons and sliders control the setup and all the parameters inside the algorithm. The graph provides the average reward on obtained per episode. The step-size parameter is the amount old values are updated towards new values. Discount is the present value worth of future rewards. Exploration-% is the amount moves the agent takes towards a non-optimum patch, which can help the agent explore more tactics and not get stuck in local optimums.
THINGS TO NOTICE
The average reward in the graph increases over the number of episodes that the agent has trained on, which shows the learning process of the agent. With the stealth technology enabled does the agent perform different tactics?
THINGS TO TRY
Experiment with the algorithm parameters such as step-size, discount, and exploration-%. Also, investigate the environmental parameters.
EXTENDING THE MODEL
Implement different reward schemes allowing more direct and optimal paths, such as -1pts for every move the agent makes forcing the agent to find a more direct approach to the ‘target’ SAM. Add a more robust exploration routine. The model is set up for multi-agent learning however, more advanced cooperation vs self-interest algorithms need to be implemented to help solve the unstable environment that multi-agent learning can cause.
TROUBLE SHOOTING
This model requires an outside file (“agent.rtf”) in order to store the learned tactics. If an error is seen for “LOAD-STATE-ACTION-FILE” click the “Clear/Create File” button and the “agent.rtf” file will be created and the file will work as long as there is permission to write in the directory where the model is stored.
CREDITS AND REFERENCES
Written by Joe Roop (Spring 2006): Joseph.Roop@asdl.gatech.edu
References:
|
(back to the NetLogo User Community Models)